Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new nni_list_node named a_finish_node in struct nng_aio #1695

Closed
wants to merge 2 commits into from

Conversation

willwu1217
Copy link

Assume that sub0_ctx_cancel and sub0_recv_cb are executed concurrently.

When sub0_recv_cb is unlocked, and before aio is deleted from the finish queue. nni_list_active in sub0_ctx_cancel may mistakenly believe that aio is still on recv_queue, thus triggering repeated execution of nni_list_remove. The second execution will trigger a null pointer access.

So unlocking after aio is deleted from the finish list.

fixes #1682

Note that the above format should be used in your git commit comments.
You agree that by submitting a PR, you have read and agreed to our
contributing guidelines.

Assume that sub0_ctx_cancel and sub0_recv_cb are executed concurrently.

When sub0_recv_cb is unlocked, and before aio is deleted from the finish queue.
nni_list_active in sub0_ctx_cancel may mistakenly believe that aio is still on recv_queue, thus triggering repeated execution of nni_list_remove.
The second execution will trigger a null pointer access.

So unlocking  after aio is deleted from the finish list.
Copy link
Contributor

@gdamore gdamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this will work in all circumstances. In particular I'm worried about a deadlock situation when the callback on the aio tries to resubmit the aio for receive. This is actually a fairly common idiom.

I think in that case, you'd wind up with an incorrect attempt to reenter the socket lock, resulting in a deadlock.

The fact that you've located the root cause here (the fact that nni_list_active is misreporting here) is super useful. I'm going to think about this because clearly that test is busted in this case -- your excellent sleuthing suggests several other solutions might be possible.

@gdamore gdamore self-requested a review October 24, 2023 15:34
Copy link
Contributor

@gdamore gdamore left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accidentally hit approve, changing to request changes.

@willwu1217
Copy link
Author

I don't think this will work in all circumstances. In particular I'm worried about a deadlock situation when the callback on the aio tries to resubmit the aio for receive. This is actually a fairly common idiom.

I think in that case, you'd wind up with an incorrect attempt to reenter the socket lock, resulting in a deadlock.

The fact that you've located the root cause here (the fact that nni_list_active is misreporting here) is super useful. I'm going to think about this because clearly that test is busted in this case -- your excellent sleuthing suggests several other solutions might be possible.

I agree with you, unlocking needs to be earlier than nni_aio_finish_sync.
Maybe we just need to save the finished aio without nni_list

1. Add finish nni_list in struct nng_aio.
2. Use a_finish_node to store finished aio in function sub0_recv_cb, thus separated from sub0_ctx recv_queue.
@willwu1217
Copy link
Author

I have updated my PR.
Restored unlock position.
Add a new nni_list_node named a_finish_node in struct nng_aio to keep finished aio. To avoid misreporting of nni_list_active in sub0_ctx_cancel

@willwu1217 willwu1217 changed the title Unlocking of sub0_sock lock after aio is deleted from the finish list Add a new nni_list_node named a_finish_node in struct nng_aio Oct 25, 2023
@gdamore
Copy link
Contributor

gdamore commented Nov 10, 2023

I need time to assess this properly -- sorry I've been busy. My two areas of particular concern, which I need to satisfy myself are:

  1. Do other protocols have the same defect (it seems likely that they do)
  2. If they don't (and maybe even if they do), would it be better to locate this somewhere else....

I'm also concerned that increasing the size of the aio will have a deleterious affect for some uses. There are situations where people have huge numbers of aios, and have complained about how big they are. This increases the size by up to 16 bytes. It might be able to make a much smaller increase, or even no increase at all, by utilizing a flag.

Still, I think your work here is good, and I might wind up merging it all, but I want to have time to consider these issues first.

@ayizhi
Copy link

ayizhi commented Nov 13, 2023

I need time to assess this properly -- sorry I've been busy. My two areas of particular concern, which I need to satisfy myself are:

  1. Do other protocols have the same defect (it seems likely that they do)
  2. If they don't (and maybe even if they do), would it be better to locate this somewhere else....

I'm also concerned that increasing the size of the aio will have a deleterious affect for some uses. There are situations where people have huge numbers of aios, and have complained about how big they are. This increases the size by up to 16 bytes. It might be able to make a much smaller increase, or even no increase at all, by utilizing a flag.

Still, I think your work here is good, and I might wind up merging it all, but I want to have time to consider these issues first.

Anticipating a prompt release

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

Sorry been a bit busy. I've been thinking about this problem though -- and one idea I have is to use the existing node as a "singly linked" node, in a way that nni_list_node_active won't think is active. We can do this because we don't have any requirement for random access (or even random removal) for the completion queue. So we only need one of the list pointers.

gdamore added a commit that referenced this pull request Nov 26, 2023
Credit goes to Wu Xuan (@willwu1217) for diagnosing and proposing
a fix as part of #1695.  This approach takes a revised approach
to avoid adding extra memory, and it also is slightly faster as we
do not need to update both pointers in the linked list, by reusing
the reap node.

As part of this a new internal API, nni_aio_completions, is introduced.
In all likelihood we will be able to use this to solve some similar
crashes in other areas of the code.
gdamore added a commit that referenced this pull request Nov 26, 2023
Credit goes to Wu Xuan (@willwu1217) for diagnosing and proposing
a fix as part of #1695.  This approach takes a revised approach
to avoid adding extra memory, and it also is slightly faster as we
do not need to update both pointers in the linked list, by reusing
the reap node.

As part of this a new internal API, nni_aio_completions, is introduced.
In all likelihood we will be able to use this to solve some similar
crashes in other areas of the code.
@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

I'm closing this in favor of #1523 -- I think what I've done there is superior, but it was inspired by the excellent work you did here, and I've given what credit I can.

@gdamore gdamore closed this Nov 26, 2023
@willwu1217
Copy link
Author

I'm closing this in favor of #1523 -- I think what I've done there is superior, but it was inspired by the excellent work you did here, and I've given what credit I can.

I have got your idea. Using an existing node as a "single link" node is indeed a better choice in terms of memory usage and efficiency. Since reap will only appear after finish, we can ensure that aio completions and aio reap will not execute concurrently.

One more little suggestion, maybe we can use an anonymous union in struct nng_aio.

union {
    nni_reap_node     a_reap_node;
    nni_completions_node     a_completions_node;
};

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

I am not sure that anonymous unions work on all C versions we need to support. I don't think it adds enough value to have them to be worth the pain their use might cause if they aren't.

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

Also btw I went ahead merged my PR. I think that there are other places in the code that suffer the same defect and I will try to find and fix them all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

setting the sub receive socket with nng_socket_set_ms will generates a segment fault
3 participants